String interning

In computer science, string interning is a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool.

String interning is supported by some modern object-oriented programming languages, including Python, Ruby (with its symbols), Java and .NET languages.

The single copy of each string is called its 'intern' and is typically looked up by a method of the string class, for example String.intern() in Java. All compile-time constant strings in Java are automatically interned using this method.[1]

Lisp, Ruby, and Smalltalk are among the languages with a symbol type that are basically interned strings.

Interned strings speed up string comparisons, which are sometimes a performance bottleneck in applications (such as compilers and dynamic programming language runtimes) that rely heavily on hash tables with string keys. Without interning, checking that two different strings are equal involves examining every character of both strings. This is slow for several reasons: it is inherently O(n) in the length of the strings; it typically requires reads from several regions of memory, which take time; and the reads fills up the processor cache, meaning there is less cache available for other needs. With interned strings, a simple object identity test suffices after the original intern operation; this is typically implemented as a pointer equality test, normally just a single machine instruction with no memory reference at all.

String interning also reduces memory usage if there are many instances of the same string value; for instance, it is read from a network or from storage. Such strings may include magic numbers or network protocol information. For example, XML parsers may intern names of tags and attributes to save memory.

Objects other than strings can be interned. For example, in Java, when primitive values are boxed into a wrapper object, certain values (any boolean, any byte, any char from 0 to 127, and any short or int between -128 and 127) are interned, and any two boxing conversions of one of these values are guaranteed to result in the same object.[2]

History

Lisp introduced the notion of interned strings for its symbols. Historically, the data structure used as a string intern pool was called an 'oblist' (when it was implemented as a linked list) or an 'obarray' (when it was implemented as an array).

Modern Lisp dialects typically distinguish symbols from strings; interning a given string returns an existing symbol or creates a new one, whose name is that string. Symbols often have additional properties that strings do not (such as storage for associated values, or namespacing): the distinction is also useful to prevent accidentally comparing an interned string with a not-necessarily-interned string, which could lead to intermittent failures depending on usage patterns.

References

External links